Skip to content

Conversation

@morgolock
Copy link
Contributor

  • The Neon(TM) implementation of TopKV reduces execution time from 447.8 ms (scalar CPP) to 11.65 ms for the same workload (F32, C=1000, N=32000, k=3, 6 threads), achieving an approximate 38× speedup. This gain comes from SIMD vectorization, removal of per-element branches, and a more efficient inner loop.

  • Resolves ARMCL-1227

Change-Id: Ifdf161ce4254dc5ecd57aff9ae22410facd31705

*/
void configure(const ITensor *predictions, const ITensor *targets, ITensor *output, const unsigned int k);

/** Static function to check if given info will lead to a valid configuration of @ref CPPTopKVKernel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CpuTopKVKernel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in next patchset

void
configure(const ITensorInfo *predictions, const ITensorInfo *targets, ITensorInfo *output, const unsigned int k);

/** Static function to check if given info will lead to a valid configuration of @ref CPPTopKVKernel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just say "similar to @ref CpuTopKV::configure()" to reduce duplication? Have a look at arm_compute/runtime/experimental/operators/CpuActivation.h

Same comment is valid for every header

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in next patchset

#define ACL_TESTS_VALIDATION_FIXTURES_TOPKVLAYERFIXTURE_H

#include "arm_compute/core/TensorShape.h"
#include "arm_compute/core/Types.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need anything particular from this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed them in next patch

@morgolock morgolock force-pushed the pr/cpu_topkv branch 4 times, most recently from 89cd2b8 to cd41038 Compare January 16, 2026 16:25

#include "src/core/common/Macros.h"
#include "src/cpu/ICpuKernel.h"

Copy link
Contributor

@gunes-arm gunes-arm Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at more carefully, I think we need to include the STL vector, string and cstdint headers in this file; not inside the corresponding cpp. Also, type_traits for std::add_pointer.

Same goes for ITensorInfo.h, ITensor.h.

We also need to include

  • arm_compute/core/Window.h for Window
  • arm_compute/core/CPP/CPPTypes.h for CPUInfo
  • src/cpu/kernels/CpuKernelSelectionTypes.h for DataTypeSelector.. stuff
  • arm_compute/core/Error.h for Status

In a nutshell, this file should be self-sufficient and we don't need to repeat the same includes in the corresponding .cpp file for brevity.

Can you please review this file and CpuTopKVKernek.cpp accordingly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in next patch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string and type_traits are not added here.
We should also add ITensorInfo.h, ITensor.h here and remove from CpuTopKVKernel.cpp.


for (unsigned int n = 0; n < N; ++n)
{
const uint32_t target = *reinterpret_cast<const uint32_t *>(targets(Coordinates{static_cast<int>(n)}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

targets[i]??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not work?

@morgolock morgolock force-pushed the pr/cpu_topkv branch 3 times, most recently from 541f913 to e0090dc Compare January 20, 2026 17:43
* The Neon(TM) implementation of TopKV reduces execution time from 447.8 ms (CPP) to 11.65 ms for the same workload (F32, C=1000, N=32000, k=3, 6 threads), achieving an approximate 38× speedup. This gain comes from SIMD vectorization, removal of per-element branches, and a more efficient inner loop

* Resolves ARMCL-1227.

Change-Id: Ifdf161ce4254dc5ecd57aff9ae22410facd31705
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright should be 2026


#include "src/core/common/Macros.h"
#include "src/cpu/ICpuKernel.h"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string and type_traits are not added here.
We should also add ITensorInfo.h, ITensor.h here and remove from CpuTopKVKernel.cpp.

// Load 8 fp16
const float16x8_t v16 = vld1q_f16(reinterpret_cast<const __fp16 *>(ptr));

// Compare in fp32 (same correctness story you already debugged)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we mean by "you already debugged" here?


} // namespace detail

void topkv_fp32_neon(const ITensor *in1, const ITensor *in2, ITensor *out, uint32_t k, const Window &win)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change these to predictions & targets in all classes and functions? (Except the operator list.dox unfortunately due to the convention there). There is a mix-up of in1/src0 in different files.

* |S32 |U32 |U8 |
* |F16 |U32 |U8 |
* |F32 |U32 |U8 |
* |U8 |U32 |U8 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this U8 line extra here


#include "tests/framework/datasets/Datasets.h"

#include <type_traits>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is type_traits here? I think we only need , am I missing anything?


TEST_SUITE(NEON)
TEST_SUITE(TopKVLayer)
// clang-format on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need these two lines.

TEST_SUITE(TopKVLayer)

template <typename T>
using CPPTopKVLayerFixture = TopKVValidationFixture<Tensor, Accessor, CPPTopKV, T>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that it was useful when comparing this implementation but we never have a CPP test suite in the NEON/ directory. The build system might miss to compile this test because we might ignore (we may already be ignoring) any files with under tests/validation/NEON when neon=0 option is provided to Scons.

I think these tests are valuable. Let's keep them but put them under tests/validation/CPP/TopKV.cpp


for (unsigned int n = 0; n < N; ++n)
{
const uint32_t target = *reinterpret_cast<const uint32_t *>(targets(Coordinates{static_cast<int>(n)}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not work?

/** Initialise the kernel's input, dst and border mode.
*
* @param[in] src0 First input tensor info. Data types supported: QASYMM8/QASYMM8_SIGNED/F16/F32/S32
* @param[in] src1 Second input tensor info. Data types supported: U32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're missing the argument k here


// targets must match batch
// targets is expected to contain N elements (shape [N])
ARM_COMPUTE_RETURN_ERROR_ON_MSG(src1.num_dimensions() < 1, "targets must have at least 1 dimension");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need to copy what's already inside CPPTopKVKernel.cpp:

    ARM_COMPUTE_RETURN_ERROR_ON(predictions->num_dimensions() > 2);
    ARM_COMPUTE_RETURN_ERROR_ON(targets->num_dimensions() > 1);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants